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Despite being a paradigm of quantitative linguistics, Zipf’s law for words suffers from three main 
problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical 
point of view, and it has not been confronted to a representatively large number of texts. So, we 
can summarize the current support of Zipf’s law in texts as anecdotic. We try to solve these issues 
by studying three different versions of Zipf’s law and fitting them to all available English texts in 
the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the 
art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. 
Remarkably, one of the three versions of Zipf’s law, consisting of a pure power-law form in the 
complementary cumulative distribution function of word frequencies, is able to fit more than 40 % 
of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies 
(from 1 to the maximum value) and with only one free parameter (the exponent). 

PACS numbers: 


INTRODUCTION 

Zipf’s law constitutes a striking quantitative regularity 
in the usage of language W- It states that, for a large 
enough piece of text, the frequency of use n of any word 
decreases with its rareness r in the text in an approxi¬ 
mately hyperbolic way, i.e., n oc l/r, where the symbol 
“oc” denotes proportionality. Technically, r is called the 
rank, and the most common (i.e., less rare) word is as¬ 
signed r = 1, the second most common, r = 2, and so on. 
A slightly more general formulation includes a parameter 
in the form of an exponent a; then, the rank-frequency 
relation takes the form of a power law, 

n oc —. (I) 

'' ' 

with the value of a close to one. 

This pattern Q has been found across different lan¬ 
guages, literary styles, time periods, and levels of mor¬ 
phological abstraction [IISIIS]. More fascinatingly, the 
same law has been claimed in other codes of communica¬ 
tion, as in music [7] or for the timbres of sounds [S] , and 
also in disparate discrete systems where individual units 
or agents gather into different classes [9], for example, 
employees into firms m, believers into religions m , in¬ 
sects into plants |12j . units of mass into animals present 
in ecosystems m, visitors or links into web pages El: 
telephone calls to users [15], or abundance of proteins 
(in a single cell) [Hj. The attempts to find an explana¬ 
tion have been diverse nnninHiii, but no solution has 
raised consensus |5D1 [53 [55]. 

Despite its quantitative character, Zipf’s law has been 
usually checked for in a qualitative way, plotting the log¬ 
arithm of the frequency n versus the logarithm of the 
rank r and looking for some domain with a roughly lin¬ 
ear behavior, with slope more or less close to —1. A more 


refined approach consists in fitting a straight line to the 
double-logarithmic plot by linear regression m- But 
several authors have recently pointed out the limitations 
of this method when applied to probability distributions 
[niiiHin], and the advantages of using an asymptoti¬ 
cally unbiased and minimum-variance procedure such as 
maximum likelihood (ML) estimation |30j . whose solu¬ 
tions, moreover, are invariant under reparameterizations 
EH 132]. One should consider then ML estimation as 
the most reliable procedure of estimation for parametric 
models (when a maximum of the likelihood does exist 
and the number of data is large). 

Furthermore, for the particular case of linguistics, the 
search for Zipf’s law has been traditionally performed in 
very limited sets of texts (less than a dozen in a typical 
research article |il33]). More recently, however, large 
corpora have been considered -these are representative 
collections of different texts aggregated together into a 
single bag, so, instead of many separated texts one deals 
with one enormous mixed text. When “rare” words are 
not considered (i.e., for high frequencies), it seems that 
Zipf’s law still holds in these large collections [ai33H37]. 
At present, there is agreement that Zipf’s law is a rough 
approximation in lexical statistics, but its range of va¬ 
lidity is totally unknown, i.e., we ignore how good Zipf’s 
law is in order to account for the appearance of words, 
and for which texts it should work -and with which level 
of precision- and for which texts it should fail. 

An extra difficulty emerges when one recognizes the ill- 
defined nature of Zipf’s law. In fact, the law has two 
formulations, with the first one being Eq. 0: which just 
counts the frequency of words. For the sake of clarity, 
the words that are counted are referred to as word types, 
in order to distinguish them from each repetition, which 
is called a token. The second formulation of Zipf’s law 
arises when, after counting the frequency of word types. 
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one performs a second statistics and counts how many 
values of the frequency are repeated, that is, how many 
word types have the same frequency. This means that 
the frequency n is considered the random variable. One 
can realize that the rank, when normalized by its max¬ 
imum value in text, is just the empirical estimation of 
the complementary cumulative distribution function of 
n, and then, the derivative of the expression for r{n) (the 
inverse of Eq. 0 ) yields a continuous approximation for 
the probability mass function /(n) of the frequency n. 
From here one obtains another power law, 

/(n) cx (2) 

with the new exponent /3 fulfilling /3 = 1 + \/a, which 
yields values of /3 close to 2. The expression given by 
Eq. 0 was in fact the first approach followed by Zipf’s 
himself [3] , and is usually considered as equivalent to Eq. 
0 0 1 Ha [13 ES] ; however, as it is derived in the 
continuum limit, both expressions can only be equiva¬ 
lent asymptotically, for large n [38] . Consequently, if one 
wants to be precise, a natural question follows: which 
one is the “true” Zipf’s law (if any)? 

We cannot know a priori which of the two Zipf’s laws 
better describes real texts, but we can argue which of 
the two representations (that of n(r), Eq.0, or that of 
/(n), Eq.(0) is better for statistical purposes, indepen¬ 
dently of the functional dependency they provide. It is 
clear that the rank-frequency representation, given by 
n(r), presents several difficulties, due to the peculiar na¬ 
ture of the rank variable. First, in Ref. [39|, Zipf-like 
texts were randomly generated following Eq. 0. keep¬ 
ing the original ranks “hidden” (as it happens in the real 
situation), and it was found that the rank reconstructed 
from the sample deviated considerably from the original 
ranks when these were taking large values (which for a 
power law happens with a high probability). The result¬ 
ing ML estimations of the exponent a were highly biased 
and the Kolmogorov-Smirnov test rejected the power-law 
hypothesis, although the original ranks were power-law 
indeed. 

One might argue that the problem could be escaped by 
using an upper truncated power-law distribution (intro¬ 
ducing then an additional parameter for the truncation), 
in order to avoid the inconsistency of the rank repre¬ 
sentation for high values. But a second problem is that 
the rank is not a true random variable [40] . as its values 
are assigned a posteriori, once the sample (i.e., the text) 
is analyzed. This means that the rank does not show 
“enough” statistical fluctuations, that is, if then 

the frequency of a is always larger, by construction, than 
the frequency of b. This does not necessarily happen 
for a true random variable. The negative correlation be¬ 
tween the variable and its frequency of occurrence makes 
the power-law hypothesis harder to reject. In fact, in¬ 
flated p-values (not uniformly distributed between 0 and 


1) have been found when fitting truncated power laws 
to simulated power-law rank-frequency representations 
[39] . This problem could still be avoided by choosing a 
low enough upper truncation parameter (yielding a very 
short range of ranks, for which the fluctuations would be 
very little) but at the expense of disregarding an impor¬ 
tant part of the data. 

A third inconvenience is the impossibility, due to normal¬ 
ization, that a non-truncated power law comprises values 
of the a—exponent smaller than 1. This yields the neces¬ 
sity of introducing a truncation parameter that may be 
artifactual, i.e., not present in the real system. All this 
leads to the conclusion that the most reliable method of 
parameter estimation (ML, in a frequentist framework) 
cannot be directly applied to the rank-frequency repre¬ 
sentation. In contrast, the representation in terms of the 
distribution of frequencies is devoid of these problems 
[39] , as n is a well-dehned random variable, and this will 
be the representation used in this paper for statistical in¬ 
ference. Nevertheless, for alternative arguments, see Ref. 

EH- 

The purpose of this paper is to quantify, at a large, 
big-data scale, different versions of Zipf’s law and their 
ranges of validity. In the next section, we present and jus¬ 
tify the three Zipf-like distributions we are going to fit, 
and we briefly explain the selected fitting method and the 
goodness-of-fit test. The corpus of texts under consider¬ 
ation is also detailed. The following section presents the 
results, with special attention to their statistical signifi¬ 
cance and their dependence with text length. Finally, we 
end with the conclusions and some technical appendices. 

ZIPF-LIKE DISTRIBUTIONS 

As implicit in the introduction, and in contrast with 
continuous random variables, in the discrete case a 
power law in the probability mass function f{n) does 
not lead to a power law in the complementary cumu¬ 
lative distribution or survival function S{n), and vice- 
versa. Let us specify our definition for both functions, 
/(n) = Prob[frequency = n] (as usual), and S{n) = 
Prob[frequency > n] (changing, for convenience, the 
usual strict inequality sign by the non-strict inequality). 
Then, the relation between both is f{n) = S'(n) —S'(n-l-l) 
and 5(n) = E“=„/(ri')- 

We consider that the values the random variable takes, 
given by n, are discrete, starting at the integer value a, 
taking values then n = a, a -I-1,... up to infinity. In this 
study we will fix the parameter a to a = 1, in order to 
fit the whole distribution and not just the tail. Then, al¬ 
though for large n and smooth S (n) we may approximate 
f{n) ~ —dS{n)/dn, this simplification, which lies in the 
equivalence between Eqs. 0 and 0, is clearly wrong 
for small n. 

For the first distribution that we consider, the power-law 
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form is in /(n), then, 


whereas 


fi{n) 


1 

C(/3,a)n/5' 


(3) 


P / \ n—>-oo 

fsin) -^ 


w-ma) 

r(o + 1 — /3)n^ 


This is just the normalized version of Eq. ([^, and then, 


S'i(n) 


({13, n) 
C(/3,a) 


with /3 > 1 and ({j3,a) = {^ 3 - k)~3 denotes the 

Hurwitz zeta function, which ensures normalization of 
both expressions. A preliminary analysis in terms of this 
distribution was done in Ref. [32] • In contrast, when the 
power law is in S{n), this leads to our second case. 


and 




(4) 


with j5 > \ again. Note that this corresponds to a power 
law in the empirical rank-frequency relation. 

Finally, it is interesting to consider also the frequency 
distribution derived by Mandelbrot |38] when ranks are 
generated independently from a power law in Eq. Q, 
which is. 


(/3-l)r(a) r(n+l-/3) 
r(a + l-,d) r(n+l) 


(5) 


and 


S 3 {n) 


r(a)r(n-b 1 -/3) 
r(n)r(a-b 1 -/3) ’ 


with 1 < /3 < 2 and r( 7 ) = x"^~^e~^dx denotes the 

gamma function |43] . In this case the power law is the un¬ 
derlying theoretical rank-frequency relation n{r). Note 
that f 3 {n) can be written as 


B{n + 1- l3,/3) 

/ 3 (^) - Bia + l-P,P-l) 

using the beta function [43], B{x,y) = r(a;)r(y)/r(a; -b 
y), with an analogous expression for 83 ( 71 ), but do not 
confuse this distribution with the beta distribution. 

In all three cases it is easy to show that we have well- 
defined, normalized probability distributions, when n 
takes values n = a, a -b 1,..., with a being a positive 
integer. Moreover, in the limit n —>■ 00 all of them yield 
a power-law tail, f{n) oc 1/n^, so /3 will be referred to as 
the power-law exponent. Indeed, it is easy to show that 

h{n)3^{l3-l)^, 


using Stirling’s formula [33|. The main difference be¬ 
tween the three distributions is in the smaller values of 
n, taking f 2 {n) a convex shape in log-scale (as seen “from 
above”); fsin) a concave one; and fi{n) being somehow 
in between, as it is neither concave nor convex. 


METHODOLOGY AND DATA 

In order to fit these three distributions to the different 
texts, and test the goodness of such fits, we use maximum 
likelihood estimation [31] followed by the Kolmogorov- 
Smirnov (KS) test |3S]. The procedure seems similar to 
the one proposed in Ref. m, but as a is fixed here, 
the problems resulting from the search of the optimum a 
[321SS] do not arise in this case. 

The method of ML estimation proceeds in the following 
simple way. Given a set of data {ni\ with i = 1, 2,... A, 
and a probability mass function parameterized by /3, de¬ 
noted as f{n-,j3) = f{n), the log-likelihood function is 
obtained as 

N 

^(/3) = (6) 

i=l 

We are assuming that the data points are indepen¬ 
dent from each other, in other words, we are calculating 
the likelihood that the data are generated independently 
from /(n;/3). The ML estimation of (3 is obtained as 
the value of /3 which maximizes /(/?); we undertake this 
numerically, using Brent’s method [33] . In the case of 
the distribution fi the log-likelihood function takes the 
simple form li[/3)/N = — ln(^(/3, a)) — /31nG, with G the 
geometric mean of the set {rii}. For the other distribu¬ 
tions no closed-form expression is possible and we use Eq. 
(§ directly. 

As mentioned, the goodness-of-fit test is done through 
the Kolmogorov-Smirnov statistic [His], in the discrete 
case m, for which the p-value is calculated from Monte 
Carlo simulations (due to the fact that, as the value of the 
exponent is calculated from the same data is going to be 
tested, the theoretically computed p-value [31] would be 
positively biased). In this paper we use 100 Monte Carlo 
simulations for each test. The proper simulation of the 
3 distributions is explained in the Appendix. Remem¬ 
ber that a small enough p-value leads to the rejection of 
the fit. Although we perform multiple testing, we do not 
incorporate any Bonferroni-like correction [4M50] . due 
to the fact that these corrections increase the number of 
non-rejected null hypothesis (that is, decrease the num¬ 
ber of type I errors), inflating the performance of the fits. 
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in the case of goodness-of-fit tests. Without Bonferroni- 
like corrections, our acceptance (i.e., non-rejection) of the 
fits is more strict. 

In order to apply this methodology we consider a set 
of 37041 texts stored in UTF-8 encoding in the Project 
Gutenberg database (accessed July 2014 [H]). These 
texts correspond to different languages, styles, and time 
periods, although most of them are works of literature 
from the Western cultural tradition [52]. First of all, 
parts of text that do not pertain to the piece under con¬ 
sideration (copyright notes, headers,...) are removed by 
an automatized process. Books that have not been fil¬ 
tered in this step (mainly because they do not have stan¬ 
dard delimiters) are discarded. After this, we still keep 
97.5% of the total (i.e., 36108). To perform our study, 
we restrict ourselves to the subset of texts in English, 
which represent the 86% of these 36108 (i.e., 31102). 
An important characteristic of each text is its length, L, 
counting the number of word tokens it contains. It turns 
out to be that in the database L expands from very small 
values up to 4659 068 tokens, with a distribution that is 
shown in Fig. [^ Observe the roughly uniform distribu¬ 
tion up to about L = 10®, and the decay afterwards. 
We consider only the 31 075 English texts that consist of 
more than 100 word tokens, as smaller texts would not 
have statistical value. For each of these texts we select 
then actual word types (punctuation signs, numbers and 
any character different from letters are not considered) 
to count their frequencies n, which will be our primary 
object of study. 

The values of these frequencies, for each text, are avail¬ 
able on http://dx.doi.org/10.6084/m9.figshare.1515919, 
in order to facilitate the reproducibility of our results. 

In summary, we apply the above described fitting and 
goodness-of-ht procedure -using ML estimation and the 
Kolmogorov-Smirnov test- to a total of 31 075 texts from 
the English Project Gutenberg database, using three dif¬ 
ferent possibilities for the distribution of frequencies: /i 
(Eq. @), h (Eq. Q), and h (Eq. (§). This yields a 
total of 3 X 31075 fits and associated p-values, which we 
analyze and interpret in what follows. 

RESULTS 

Gontrary to previous studies where the number of texts 
considered was, at most, in the order of tens, the large- 
scale approach taken in this work requires a statistical 
analysis of the fitting results, as a case-by-case interpre¬ 
tation is out of hand. We first focus on the distribution 
of p-values, see Fig. and Fig. If all texts were truly 
generated by a mechanism following a given distribution, 
the corresponding p-values for that distribution would be 
uniformly distributed between zero and one. As seen in 
Fig.i this is not the case and, furthermore, most texts 
have rather small p-values for the three fits; nevertheless. 



FIG. 1: Estimation of the probability density function of text 
length L in the English Project Gutenberg database, using 
logarithmic binning (5 bins per decade). Texts with less than 
100 tokens are not considered. A power-law fit of the tail |32| 
yields an exponent 2.92 ± 0.15. 

for distributions fi and /2 there are still many texts that 
yield high enough p-values. This implies that, although 
we cannot conclude that the whole database is generated 
by any of these distributions, these cannot be rejected as 
good descriptions for large subsets of the database. Re¬ 
garding distribution f^, it is clear from the histogram of 
p-values that it can be discarded as a good description 
of the distribution of frequencies in any non-negligible 
subset of texts. So, from now on, we will concentrate on 
the remaining options, fi and / 2 , to eventually quantify 
which of these better describes our corpus. In essence, 
what we are interested in is which version of Zipf’s law, 
either distribution /i or / 2 , fits better a reasonable num¬ 
ber of texts, and which range of validity these simple 
one-parameter distributions have. 

The outcome is that, independently of the significance 
level (as long as this is not below our resolution of 0.01 
given the number of Monte Carlo simulations), the ra¬ 
tio between the number of texts fitted by distribution /2 
and those fitted by fi is nearly constant, taking a value 
around 2.6. For example, considering significance level 
(i.e., minimum p-value) equal to 0.05, Fig. shows that 
distribution /2 fits about 40% of all texts, whereas dis¬ 
tribution /i fits just 15%. Both percentages include a 
2.7% of texts that are fitted by both distributions simul¬ 
taneously, although this number does not keep a con¬ 
stant ratio with the other two, rather, it decreases when 
the significance level is increased (as it is implicit in the 
values of Fig. [^. Given that the aforementioned ratio 
of 2.6 is independent of the significance level, it is fair 
to say that distribution /2 provides, compared to /i, a 
better description of our database. As a visual illustra¬ 
tion of the performance of the fits we display in Fig. 
the word frequency distribution of the longest texts that 
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have p > 0.5, for distributions /i, /2 and /a. 



0 0.2 0.4 0.6 0.8 1 
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FIG. 2: Histograms of p-values obtained when the Zipf- 
like distributions /i, /a, and /a are fitted to the texts of the 
English Project Gutenberg. The histograms just count the 
number of texts in each bin of width 0.01. Note the poor 
performance of distribution 3 and the best performance of 
2. Power-law approximations to the histograms for /i and /a, 
with respective exponents 0.74 and 0.78, are shown as a guide 
to the eye. 
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FIG. 3: Gomplementary cumulative distributions (i.e., sur¬ 
vival functions) of p-values obtained when our three distribu¬ 
tions are htted to the texts of the English Project Gutenberg. 
This corresponds, except for normalization, to the integral of 
the previous figure, but we have included a fourth curve for 
the fraction of texts whose p-values for fits 1 and 2 are both 
higher than the value marked in the abscissa. Note that the 
values of p can play the role of the significance level. The value 
for p = 0 is not shown, in order to have higher resolution. 

The next question we address is the dependence of the 
performance of fits on text length L. In order to asses 
this, note that from the shape of the histograms in Fig.[^ 
we can distinguish two groups of texts: those that lie in 


FIG. 4: Complementary cumulative distribution and proba¬ 
bility mass function of texts: (a) A Chronicle of London, from 
1089 to 1483■, (b) The letters of Charles and Mary Lamb, 
edited by E. V. Lucas; (c) A Popular History of France from 
the Earliest Times, Vol. I, by E. Guizot. These texts are 
the ones with the largest length L (83 720, 239 018 and 2 081 
respectively) of those that fulfill p > 1/2, for hts 1, 2 and 
3 respectively. The exponent /3 takes values 1.96, 1.89, and 
1.82, in each case. 
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the zero bin (whose p-value is strictly less than 0.01), and 
the rest. Taking the last group, i.e., texts with p > 0.01, 
and partitioning it into different subsets according to text 
length (i.e., looking at the distribution of p conditioned to 
p > 0.01 for different ranges of L), it holds that the shape 
of the resulting distribution of p does not strongly depend 
on L, as shown in Fig.[^ In contrast, the number of texts 
that yield p-value near zero certainly varies with L, see 
Fig.|^ Therefore, in order to compare the performances 
of fi and /2 as a function of the text length L, it is enough 
to consider a single value of the significance level (greater 
than zero) as the results for any other significance level 
will be the same, in relative terms. 

Indeed, Fig. shows how distribution /i fits some 
more texts than distribution /2 for small values of L, up 
to about 13 000 tokens. But for larger texts, distribution 
/2 clearly outperforms distribution fi , which becomes ir¬ 
relevant for L beyond 100 000 (at 0.05 significance level), 
whereas distribution /2 is able to fit many texts with L 
larger than 200 000. The figure shows that this is the 
case no matter if the significance level is 0.05, 0.20, or 
0.50; the collapse of the curves in Fig. [^b) confirms this 
fact. From Fig.j^one could infer the same for significance 
level equal to 0.01. This stability of the performance of 
the fits for different significance levels arises from the ob¬ 
served fact that the distributions of p-values (conditioned 
to p > 0.01) are nearly identical for different L, as shown 
in Fig. 

In order to check the consistency of our fitting procedure, 
we also perform a direct comparison of models through 
the likelihood ratio (LR) test [TTl |23]. We apply this 
test to all texts that have been fitted, considering 0.05 
as significance level, by at least one of the two distribu¬ 
tions fi and / 2 . Then the log-likelihood-ratio between 
distributions fi and /2 is 

N 

^1,2 = X! (ln/l(?^z) - ln/2(?T.i)) : 

i=l 

and, under the null hypothesis that both models are 
equally good to describe the data, i?i ^2 should be nor¬ 
mally distributed with zero mean and a variance that can 
be estimated as TVcr^, with the variance of the random 
variable ln/i(n) — ln/ 2 (n). Large absolute values of Ri ,2 
will lead to the rejection of the null hypothesis. 

Table |l] merges the results of the LR test and our previ¬ 
ous procedure (based on ML estimation plus the KS test). 
The total number of texts previously fitted by /i or/and 
/2 is displayed depending on the sign of the correspond¬ 
ing log-ratio i?i, 2 - However we must take into account 
that the sign of the obtained value of i?i ,2 could be a 
product of just statistical fluctuations if the true value 
were zero and thus, the sign of i?i ^2 cannot be trusted in 
order to discriminate between two models. The proba¬ 
bility under the null hypothesis, of obtaining an absolute 
value of the log-ratio greater than the empirical value 



P 
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FIG. 5; Estimated probability density functions of p-values 
conditioned to p > 0.01 separating for different ranges of text 
length L. p-values correspond to the fitting of word frequen¬ 
cies to (a) distribution fi and (b) distribution / 2 . We di¬ 
vide the distribution of text length into 15 intervals of 2 000 
texts each. For distribution fi only the first seven groups 
(up to length 34 400) are displayed (beyond this value we do 
not have enough statistics to see the distribution of p-values 
greater than 0.01, as displayed in Fig. for distribution 2 
this happens only in the last two groups). The intervals Li 
range from Li = [115,5 291] to hs = [89 476,103 767]. 


|i?i_ 2 | is computed through: 


Plr = erfc 


1^1, 2 I 

V27Vcr2 


where erfc is the complementary error function [43] . We 
will take as statistically significant those Ri ^2 that yield 
Plr < 0.05. Equivalently, at 0.05 significance level, Ri ,2 
is significant if its absolute value is greater than Rc = 
l.ObV Na"^. The results are shown in Table |ll| 

Note that the LR test cannot conclude if a fit is good or 
bad, as it only compares the relative performance of two 
fits; in other words, if the LR test selects a particular 
distribution, that distribution can still yield a bad fit, in 
absolute terms. Anyway, there is no mismatch between 
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FIG. 6: Number of texts with p-value near zero {p < 0.01) 
in different ranges of L divided by the number of texts in the 
same ranges, for the fits of distributions fi and / 2 . Values of 
L denote the geometric mean of ranges containing 1000 texts 
each. The higher value for fit 1 (except for L below about 
13000 tokens) denotes its worst performance. 

the results of both tests: any time the ML-KS method 
selects one distribution over the other, the LR test ei¬ 
ther supports the selection or does not give significant 
results, but it never selects the other option (as shown in 
Table |n|. 




^1,2 > 0 

^1,2 < 0 

Total ML-KS 

/i (exclusively) 

3614 

81 

3695 

/2 (exclusively) 

120 

11366 

11486 

/i and /2 

431 

398 

829 

Total LR 

4165 

11845 

16010 


TABLE I: The number of texts that are fitted by /i or /2 
or both at 0.05 significance level of the ML-KS procedure, 
separated into two columns according to the sign of Ri, 2 . 
Positive i?i,2 means that the likelihood for fi is greater than 
that for / 2 , and conversely for negative i?i, 2 . Nevertheless, the 
sign of i?i ,2 is not an indication of significance, for signihcant 
LR tests see Table HIl 


Taking now those texts whose frequency distributions 
could be approximated by fi or / 2 , we draw attention 
to the distribution of the estimated exponents (i.e., the 
parameter /3). The original formulation of Zipf’s law im¬ 
plies /3 = 2 and Fig. shows that /3 is certainly dis¬ 
tributed around 2, with a bell-like shape. If we check the 
effect of the text length L in the distribution of /3, we find 
a decreasing trend of (3 with L, as can be seen in Fig. 
We have tested that this observation is not an artifact 


FIG. 7: (a) Histograms showing the fraction of accepted 

texts by the three distributions as a function of their text 
iength, for three different signihcance ieveis po: 0.05 (upper 
curves), 0.20 (middle), 0.50 (lower). To be concrete, for each 
range of L, the ratio between the number of texts with p> po 
and the number of texts in that range is calculated, (b) Same 
curves (removing those for distribution 3) under rescaling. 
We rescale the j/—axis by the number of p > po, in each case, 
showing that the relative performance of each fit with regard 
L is independent on the signihcance level. Bins are selected 
to contain 1000 texts each. 



Ri ,2 > Rc 

A 

1 

fi (exclusively) 

1666 

0 

/2 (exclusively) 

0 

9423 

fi and /2 

0 

3 

Total LR test 

1666 

9426 

None (neither /i nor / 2 ) 

510 

11431 


TABLE II: Number of texts with a signihcant LR test, at 
the 0.05 level, either favouring distribution /i {Ri,2 > Rc) or 
distribution /2 {Ri,2 < —Rc), for diherent outcomes of the 
ML-KS procedure (at the 0.05 level also). Note that these 
cases correspond to a subset of the previous table. An addi¬ 
tional row shows the number of texts that are htted neither 
by distribution fi nor / 2 ; notice that in this case a signihcant 
LR test does not guarantee a good ht. 
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FIG. 8: Estimation of the probability density of the expo¬ 
nent /3 for texts yielding p > 0.05 in fit 1 and fit 2. Curves 
have been calculated from the histograms via normal ker¬ 
nel smoothing method as implemented in MatLab (ksdensity 
function). Estimated mean and standard deviation are 2.03 
and 0.15 respectively for fit 1, and 2.02 and 0.17 for fit 2. 


of the fitting method, as synthetic texts generated with 
fixed /3 do not show this behavior. We have no theoreti¬ 
cal explanation for this fact, but notice that this trend is 
not in disagreement with the claims of Ref. [33], where 
the stability of the exponent /3 was demonstrated for a 
single growing text (i.e., comparing small parts of a text 
with the whole). 


CONCLUSIONS 

Zipf’s law is probably the most intriguing and at the 
same time well-studied experimental law of quantitative 
linguistics, and extremely popular in its wider sense in 
the science of complex systems. Although the previous 
literature is vast, as far as we know our work constitutes 
the first large-scale analysis of Zipf’s law in single (non- 
aggregated) texts. Thus, we are in a position to make a 
well-grounded statement about the validity of Zipf’s law 
in such texts. 

Let us first briefly summarize, however, some key tech¬ 
nical points of our study. First, we have analyzed a 
total of 31 075 English texts from the Project Guten¬ 
berg database using rigorous fitting procedures, and have 
tested how well they are described by three Zipf-like dis¬ 
tributions. Our choice of distributions has not been ex¬ 
haustive; rather, we have limited ourselves to different in¬ 
terpretations of what can be understood as “Zipf’s law”, 
in the sense of having a perfect power law either in the 
probability mass function of word frequencies, or in the 
complementary cumulative distribution function (whose 
empirical estimation leads to the rank-frequency relation 
of the sample), or in the rank-frequency relation of an 
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FIG. 9: Estimated probability density of /? for fits with 
p > 0.05, in different length ranges. We have divided both 
groups of accepted texts into 4 percentiles according to L. As 
in the previous figure, the normal kernel smoothing method 
is applied, (a) For distribution /i. (b) For distribution / 2 . 


underlying population. Remarkably, the resulting distri¬ 
butions have a unique parameter, /3, which in all cases is 
the exponent of an asymptotic power law in the proba¬ 
bility mass function of the frequency. It is left to explore 
how other, more complicated extensions of Zipf’s law per¬ 
form on this large corpus, but it is obvious that, by in¬ 
cluding additional parameters, one might provide good 
fits to a larger number of texts (although in this case, 
proper model selection will require to balance number of 
parameters and parsimony). 

Our aim in this paper has not been to fit as many texts as 
possible, but to test the performance of the simplest Zipf- 
like distributions within a very strict, conservative frame¬ 
work. Indeed, by requiring the three versions of Zipf’s 
law to hold on the full range of frequencies n = 1 , 2 ,... 
(and not only on the tail of the distribution) we put our¬ 
selves in the strictest range of demands. It is hence re¬ 
markable that, e.g., at the standard significance level of 
0.05, and for text lengths between 10^ and 10® word to- 
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kens, more than 40 % of the considered texts are statisti¬ 
cally compatible with the pure power law in the comple¬ 
mentary cumulative distribution function represented by 
distribution /2 (see Fig.[^. So, we can state that, for the 
corpus under consideration, the most appropriate version 
of Zipf’s law is given by a probability mass function 

/(n) = Prob[frequency = n] = ^ 

or, equivalently, by a complementary cumulative distri¬ 
bution function 

S( n) = P rob [frequency > n] = ■ 

Due to the broad coverage of the Project Gutenberg 
corpus we speculate that this distribution should fit a 
large fraction of generic (non-technical) English texts. 
Of course, testing this speculation in front of all possible 
corpora is an impossible task. 

We have also shown that our conclusions regarding the 
relative performance of a pure power law in the prob¬ 
ability mass function, given by distribution /i, versus 
distribution /2 are robust with respect to changes in the 
significance level: about twice as many texts are statis¬ 
tically compatible with distribution /2 than those com¬ 
patible with /i, at any significance level (obviously, in 
absolute terms, the number of accepted texts varies with 
the signihcance level). Hence we can conclude that dis¬ 
tribution /2 gives a better description of English texts 
than distribution /i, at least for the corpus considered 
in this work. We also conclude that distribution /a, first 
derived by Mandelbrot [38j, is irrelevant for the descrip¬ 
tion of texts in this corpus. Finally, we have corroborated 
that the exponent /? of Zipf’s law certainly varies from 
text to text, as had been previously claimed using other 
approaches for defining what Zipf’s law is 011]. Interest¬ 
ingly, the value /? = 2 originally proposed by Zipf himself 
is among the most frequent ones. 

We believe that our analysis constitutes a major advance¬ 
ment in the understanding of Zipf’s law. It is astonishing 
how good the simplest one-parameter Zipf-like distribu¬ 
tions perform on such a large set of texts, particularly 
with the strict set of requirements we have imposed. This 
is in sharp contrast for instance with Zipf’s law in demog¬ 
raphy |54| and in the distribution of income |55j . where 
the power law seems to be valid only for the tail corre¬ 
sponding to the largest sizes, as it happens also for the 
distribution of word frequency in large text corpora, as 
mentioned above [21 . 

Zipf’s law has been subject to much debate, and will 
probably continue to be so for many years. Indeed, one 
can always cast doubt on its validity on the basis of some 
particular examples. Yet it seems clear to us that, in our 
modern times of big data and large computational ca¬ 
pabilities, more efforts should be put towards large-scale 
analysis of Zipf’s law. We hope this paper constitutes a 
first step in this direction. 


APPENDIX: SIMULATION OF DISCRETE 
ZIPF-LIKE DISTRIBUTIONS 


As part of the testing procedure, we need simulated sam¬ 
ples from /i, / 2 , and /s, which are discrete distributions 

defined for n = a, a -I- 1,_ We will give the recipe of 

simulation for an arbitrary positive integer value of the 
lower cut-off a. It is simpler to start with / 2 , as this is 
used as an auxiliary distribution in the simulation of the 
other two. 

Simulation of f 2 - Fixed a and given the parameter /3, we 
want a set of random numbers whose cumulative distribu¬ 
tion function is a discrete power law: S 2 {n) = (a/n)^ ^. 
For that, we first generate a random number u from 
a uniform distribution in the interval (0,Umax)^ with 
Umax = If we take x = it turns 

out to be that the values of x yield a continuous power 
law with S^ix) = (a/x)^ for x > a, where the super¬ 
script c distinguishes the continuous distribution from its 
discrete analogous one. So, taking n equal to the integer 
part of X, i.e., n = int(a::), yields a discrete distribution 
with S 2 {n) = (a/n)^~^, as desired. This is so because, 
for any X, int(A) > n is equivalent to X > n for n 
integer. In a recipe: 

• generate u from a uniform distribution in 
(0,1/a^-i], 

• calculate x = 

• take n = int(a;). 

By means of the so-called rejection method [56] . simu¬ 
lated integers distributed following /2 can be used for the 
simulation of integers following /i or f^. The key point 
to achieve a high performance in the rejection method is 
to use a “good” auxiliary function, i.e., one that leads 
to a low rejection rate. This is certainly the case in our 
framework, as explained below. 

Simulation of fi. In this case, the steps are: 

• generate n from / 2 (u), 

• generate v from a uniform distribution in the unit 
interval. 


• n is accepted if 


fijn 

h{n)C 


and rejected otherwise, where C is the rejection 
constant given by C = max„>a {/i(n)// 2 (u)}. 

It is easy to check that the maximum of / 1//2 is reached 
at n = a as this is a decreasing function [52]. The 
acceptance condition above can be simplified by taking 
r = (1 -I- and 6 = (1 -I- then, the con¬ 

dition becomes: 


bvn{T — 1) < a{h — l)r. 
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which is devoid of the calculation of the Hurwitz-zeta 
function. This is a generalization for a > 1 of the method 
of Ref. [56] . The choice of /2 as the auxiliary distribution 
function is justified by the small value that C takes, as 
this is the expected number of generated values of n until 
we accept one. For instance, for /3 = 2 and a = 1 we get 
C = 1.2158. 

Simulation of f^. Proceeding similarly, we get in this case 
low values of C = max„>a {f^{n)/f 2 {n)} as well (we get 
C = 2 in the limit /3 —)■ 2 for a = 1). The maximum of 
f^{n)/f 2 {n) is numerically seen to be reached at n = a. 
In summary, the steps are: 

• generate n from / 2 (n) 
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• generate v from a uniform distribution in the unit 
interval 

• n is accepted if 


vf 2 {n) < a 





r(n-(/3-l)) r(a) 
r(n + 1) r(r+ a-(3) 


and rejected otherwise. 
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